No Colab: saltando montaje de Drive.
Load Libraries¶
dataset_modified: (39235, 62)
No Colab: saltando montaje de Drive.
Load data¶
✓ Cargado data/processed/clean.csv -> (39235, 62)
dtypes (primeras 8): url string[python] timedelta Int64 n_tokens_title Float64 n_tokens_content Float64 n_unique_tokens Float64 n_non_stop_words Float64 n_non_stop_unique_tokens Float64 num_hrefs Float64 dtype: object
| url | timedelta | n_tokens_title | n_tokens_content | n_unique_tokens | n_non_stop_words | n_non_stop_unique_tokens | num_hrefs | num_self_hrefs | num_imgs | ... | max_positive_polarity | avg_negative_polarity | min_negative_polarity | max_negative_polarity | title_subjectivity | title_sentiment_polarity | abs_title_subjectivity | abs_title_sentiment_polarity | shares | mixed_type_col | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | http://mashable.com/2013/01/07/amazon-instant-... | 731 | 12.0 | 219.0 | 0.663594 | 1.0 | 0.815385 | 4.0 | 2.0 | 1.0 | ... | 0.7 | -0.35 | -0.6 | -0.1 | 0.5 | -0.1875 | 0.0 | 0.1875 | 593.0 | 493 |
| 1 | http://mashable.com/2013/01/07/ap-samsung-spon... | 731 | 9.0 | 255.0 | 0.604743 | 1.0 | 0.791946 | 3.0 | 1.0 | 1.0 | ... | 0.7 | -0.11875 | -0.125 | -0.1 | 0.0 | 0.0 | 0.5 | 0.0 | 711.0 | 639 |
| 2 | http://mashable.com/2013/01/07/apple-40-billio... | 731 | 9.0 | 211.0 | 0.57513 | 1.0 | 0.663866 | 3.0 | 1.0 | 1.0 | ... | 1.0 | -0.466667 | -0.8 | -0.133333 | 0.0 | 0.0 | 0.5 | 0.0 | 1500.0 | 493 |
| 3 | http://mashable.com/2013/01/07/astronaut-notre... | 731 | 9.0 | 531.0 | 0.503788 | 1.0 | 0.665635 | 9.0 | 0.0 | 1.0 | ... | 0.8 | -0.369697 | -0.6 | -0.166667 | 0.0 | 0.0 | 0.5 | 0.0 | 1200.0 | 688 |
| 4 | http://mashable.com/2013/01/07/att-u-verse-apps/ | 731 | 13.0 | 1072.0 | 0.415646 | 1.0 | 0.54089 | 19.0 | 12.842425 | 20.0 | ... | 1.0 | -0.220192 | -0.5 | -0.05 | 0.454545 | 0.136364 | 0.045455 | 0.136364 | 505.0 | 579 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 39230 | http://mashable.com/2014/12/27/samsung-app-aut... | 8 | 16.575486 | 346.0 | 0.529052 | 1.0 | 0.684783 | 9.0 | 7.0 | 1.0 | ... | 0.75 | -0.26 | -0.5 | -0.125 | 0.1 | 0.0 | 0.4 | 0.0 | 1800.0 | 253 |
| 39231 | http://mashable.com/2014/12/27/seth-rogen-jame... | 8 | 12.0 | 328.0 | 0.696296 | 1.0 | 0.885057 | 9.0 | 7.0 | 3.0 | ... | 0.7 | -0.211111 | -0.4 | -0.1 | 0.3 | 1.0 | 0.2 | 1.0 | 1900.0 | 493 |
| 39232 | http://mashable.com/2014/12/27/son-pays-off-mo... | 8 | 10.0 | 442.0 | 0.516355 | 1.0 | 0.644128 | 24.0 | 1.0 | 12.0 | ... | 0.5 | -0.356439 | -0.8 | -0.166667 | 0.454545 | 0.136364 | 0.045455 | 0.136364 | 1900.0 | 555 |
| 39233 | http://mashable.com/2014/12/27/ukraine-blasts/ | 8 | 6.0 | 682.0 | 0.539493 | 1.0 | 0.692661 | 10.0 | 1.0 | 1.0 | ... | 0.5 | -0.253332 | -0.5 | -0.0125 | 0.0 | 0.0 | 0.5 | 0.0 | 1100.0 | 493 |
| 39234 | http://mashable.com/2014/12/27/youtube-channel... | 8 | 10.0 | 1822.919305 | 0.701987 | 1.0 | 0.846154 | 1.0 | 1.0 | 0.0 | ... | 0.5 | -0.2 | -0.2 | -0.2 | 0.333333 | 0.25 | 0.166667 | 0.25 | 1300.0 | 703 |
39235 rows × 62 columns
url string[python]
timedelta Int64
n_tokens_title Float64
n_tokens_content Float64
n_unique_tokens Float64
...
title_sentiment_polarity Float64
abs_title_subjectivity Float64
abs_title_sentiment_polarity Float64
shares Float64
mixed_type_col Int64
Length: 62, dtype: object
Step 1 EDA - Clean Dataframe and describe columns¶
Classes and functions to clean columns and insert into pipeline¶
Define column type¶
(1, 14, 47)
Classify numeric columns¶
(18, 2, 3)
Preprocess columns¶
Original shape : (39235, 62) Cleaned shape : (39235, 62) Any NA in url? : False Duplicate urls : False
Describe the columns¶
LDA_00 float64
LDA_01 float64
LDA_02 float64
LDA_03 float64
LDA_04 float64
...
weekday_is_friday int64
weekday_is_saturday int64
weekday_is_sunday int64
is_weekend int64
url object
Length: 62, dtype: object
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| LDA_00 | 39235.0 | NaN | NaN | NaN | 0.1844 | 0.261623 | 0.010289 | 0.025182 | 0.033402 | 0.240389 | 0.926994 |
| LDA_01 | 39235.0 | NaN | NaN | NaN | 0.142881 | 0.220038 | 0.01029 | 0.025032 | 0.033348 | 0.156095 | 0.925947 |
| LDA_02 | 39235.0 | NaN | NaN | NaN | 0.215967 | 0.28094 | 0.010005 | 0.028572 | 0.040007 | 0.327956 | 0.919999 |
| LDA_03 | 39235.0 | NaN | NaN | NaN | 0.222839 | 0.293553 | 0.010838 | 0.028572 | 0.040001 | 0.366368 | 0.926534 |
| LDA_04 | 39235.0 | NaN | NaN | NaN | 0.233914 | 0.28832 | 0.010679 | 0.02858 | 0.047619 | 0.396494 | 0.927191 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| weekday_is_friday | 39235.0 | NaN | NaN | NaN | 0.140232 | 0.347232 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| weekday_is_saturday | 39235.0 | NaN | NaN | NaN | 0.060176 | 0.237815 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| weekday_is_sunday | 39235.0 | NaN | NaN | NaN | 0.06721 | 0.250389 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| is_weekend | 39235.0 | NaN | NaN | NaN | 0.127386 | 0.333409 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| url | 39235 | 39235 | http://mashable.com/2013/01/07/amazon-instant-... | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
62 rows × 11 columns
Step 2 EDA - graphs¶
Function to graph¶
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col], /tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col], /tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col], /tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col], /tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col], /tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col], /tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col], /tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col], /tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col], /tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col], /tmp/ipykernel_103080/1188193524.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=df[col],
ML model¶
Split dataframe for Train, Validation and Test and drop columns that are redundant¶
Agregar analisis de correlacion¶
The following columns where removed from the X¶
- average_token_length since it is an average correlated to n_tokens_content
- kw_avg_min since it is an average correlated to kw_min_min and kw_max_min
- kw_avg_max since it is an average correlated to kw_max_min and kw_max_max
- self_reference_min_shares since it is normally data obtained after models have been deployed
- self_reference_max_shares since it is normally data obtained after models have been deployed
- self_reference_avg_sharess since it is normally data obtained after models have been deployed
- is_weekend since we have columns for saturday and sunday
- weekday_is_sunday since we can determine by knowing if the other days did not apply
- avg_positive_polarity since it is an abg correlated to min_positive_polarity and max_positive_polarity
- avg_negative_polarity since it is an abg correlated to min_negative_polarity and max_negative_polarity
- url since it is a string column that can be used as the index
- shares since that is our output
- kw_avg_avg since it is correlated to kw_max_avg and kw_min_avg
- timedelta since it is a not predictive column
Original dataframe (39235, 62) Cleaned dataframe (39235, 62) X_train (27464, 48) X_val (5885, 48) X_test (5886, 48) y_train (27464,) y_val (5885,) y_test (5886,)
Pipeline to improve features distributions¶
48
Dimensión de los datos de entrada: antes de aplicar las transformaciones: (27464, 48) después de aplicar las transformaciones: (27464, 48)
Histogram after pipeline¶
| LDA_00 | LDA_01 | LDA_02 | LDA_03 | LDA_04 | abs_title_sentiment_polarity | abs_title_subjectivity | global_rate_negative_words | global_rate_positive_words | global_sentiment_polarity | ... | data_channel_is_bus | data_channel_is_socmed | data_channel_is_tech | data_channel_is_world | weekday_is_monday | weekday_is_tuesday | weekday_is_wednesday | weekday_is_thursday | weekday_is_friday | weekday_is_saturday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38677 | 1.549072 | -0.424829 | 0.846910 | 0.444943 | -0.734875 | 1.383421 | -0.772061 | 0.0 | 0.042486 | 1.293728 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 5096 | 0.835936 | -0.551436 | -0.746486 | 1.556585 | -0.809549 | 1.383421 | -1.315197 | 0.0 | -1.852629 | -4.404387 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 26446 | -0.556042 | -0.426287 | 1.657392 | -0.650452 | -0.738698 | -0.883153 | 0.864344 | 0.0 | -0.004052 | 0.752190 | ... | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 5588 | -0.419554 | -0.241316 | 1.493192 | -0.539757 | 0.723072 | -0.466910 | 0.316645 | 0.0 | 0.042486 | -0.169173 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 16614 | 0.704288 | -0.849119 | -0.915578 | 1.593365 | -0.958407 | -0.883153 | 0.864344 | 0.0 | 0.389315 | 1.048159 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 48 columns
Merge X_train and X_test¶
Dimensión de las variables de entrada ANTES de las transformaciones: (33350, 48) Dimensión de las variables de entrada DESPUÉS de las transformaciones: (33350, 48)
functions¶
ML models¶
Regresion lineal¶
Find best parameters while using column transformation Fitting 15 folds for each of 8 candidates, totalling 120 fits
>> Linear_Regression
Mejor RMSE (CV): 3974.5696 usando {'model__copy_X': True, 'model__fit_intercept': True, 'model__positive': False}
------------------------------------------------------------------------------------------
rmse : 3974.5696 – 4081.8782 (std avg 90.238)
mae : 2340.2179 – 2381.9597 (std avg 32.343)
mape : 1.5913 – 1.6280 (std avg 0.325)
r2 : -0.0035 – 0.0486 (std avg 0.006)
------------------------------------------------------------------------------------------
Find best parameters while using column transformation Fitting 15 folds for each of 8 candidates, totalling 120 fits
Linear_Regression Mejor RMSE (CV): 3974.5696 usando {'model__copy_X': True, 'model__fit_intercept': True, 'model__positive': False}
rmse : 3974.5696 – 4081.8782 (std avg 90.238) mae : 2340.2179 – 2381.9597 (std avg 32.343) mape : 1.5913 – 1.6280 (std avg 0.325) r2 : -0.0035 – 0.0486 (std avg 0.006)¶
✅ Artefactos guardados: - models/Linear_Regression.joblib - models/KNN.joblib - reports/models/cv_results_summary.csv - reports/models/metrics.json
k-Vecinos Más Cercanos (kNN)italicized text¶
Find best parameters while using column transformation Fitting 15 folds for each of 8 candidates, totalling 120 fits
>> K_neighbors_nearest
Mejor RMSE (CV): 4014.1910 usando {'model__algorithm': 'auto', 'model__n_neighbors': 21, 'model__p': 1, 'model__weights': 'uniform'}
------------------------------------------------------------------------------------------
rmse : 4014.1910 – 4271.3955 (std avg 91.015)
mae : 2274.1715 – 2450.0750 (std avg 40.211)
mape : 1.4403 – 1.5850 (std avg 0.330)
r2 : -0.0990 – 0.0295 (std avg 0.010)
------------------------------------------------------------------------------------------
Decision tree¶
Find best parameters while using column transformation Fitting 15 folds for each of 12 candidates, totalling 180 fits
Decision_tree Mejor RMSE (CV): 23400.2251 usando {'model__criterion': 'absolute_error', 'model__max_depth': 7, 'model__max_features': 'sqrt'}
rmse : 23400.2251 – 39427.2034 (std avg 5531.391) mae : 3138.4086 – 6596.0320 (std avg 385.893) mape : 0.6508 – 3.4537 (std avg 0.298) r2 : -2.5399 – -0.0223 (std avg 0.896)